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3 From Big to Democratic Data 


Why the Rise of AI Needs 
Data Solidarity 


Mercedes Bunz and Photini Vrikki 


Digital technologies and their processing of data have transformed 
our cultural, social and working lives through expansive digital con- 
nections and networks, allowing us to undertake social, cultural and 
economic transactions that shape global and local communities. This 
digital space is a sphere in which users interact, thereby creating data, 
which is then collected and analyzed shaping their societal possibili- 
ties through recommendations or algorithmic decision-making. Yet, 
paradoxically, in spite of the ubiquitous reach of our digital condition, 
the political force within data shaping our societies is only in parts 
understood. One reason for this is that the notion of “big data” at the 
beginning of the 21st century has been conceptualized by businesses 
and for the business world, as Puschmann and Burgess (2014) have 
shown. Given the significance of data in our public and everyday lives, 
many find the strong, confining link between data and business alarm- 
ing; this is even more so, since data has gained societal and political 
importance through further technical developments in areas such as 
artificial intelligence (AI). As we will show in this text, recent advances 
in AI, particularly in the area of machine learning (ML; in which sys- 
tems are trained on huge datasets), have opened up new possibilities 
for data analysis that have further strengthened the societal role of 
data in our political and social lives. This is why data needs to be un- 
derstood more than ever not just as an economic opportunity but also 
as a democratic frontier. 

When discussing data from the perspective of democracy, next to 
the rights of the individual and the effect of data on the individual, the 
effect of data on the collective, i.e., the shaping of a society, comes into 
view. Recently, a range of scholars have started to explore this collective 
value of datasets systematically and have shown that value for popu- 
lations can be gained from insights into data relations emerging be- 
tween individual data entries (Viljoen, 2021). This point is important, 


DOI: 10.4324/9781003173427-3 


2 Mercedes Bunz and Photini Vrikki 


as it highlights the power datasets have to drive benefits for societies 
(and not just companies), widely known as “data for the public good”, 
which some argue could be governed by independent data trusts (see 
Delacroix, Pineau & Montgomery, 2021), a construct that is somewhat 
linked to the notion of digital commons (Dulong de Rosnay & Stalder, 
2020). Such research into data trusts or digital commons stresses the 
collective value of data and calls for revisiting the principles of data 
governance, i.e., the processes that manage the availability, usability 
and security of data. Among these three aspects, it was the latter, the 
aspect of security and loss of privacy leading to a growing surveil- 
lance (Zuboff, 2019) that, at the beginning of the 21st century, gained 
most public attention with some positive effects. A variety of govern- 
ments have tackled this issue by legislation amendments, one of the 
most far-reaching being Europe’s General Data Protection Regulation 
(GDPR). The principles of availability and usability, however, were 
likewise discussed beyond experts and data science. Both principles 
have gained the attention of data activists, non-governmental organ- 
izations (NGOs) and even politicians — an attention that is now newly 
required. In his excellent genealogy of Open Data, Jonathan Gray 
(2014) has shown the wide range of initiatives Open Data has surfaced, 
from neoliberal takes to widening civic participation. Among them, 
we find calls for: 


- opening data in a push for transparency to hold the public sector 
to account; 

- reducing government by transforming it to a platform service; 

- making data available that could be useful for businesses fostering 
economic growth and innovation; 

- allowing citizens to reuse their data and/or to make their data 
portable from one platform to another; and 

- making use of data to advance societal issues through civic hacking. 


While the benefits and drawbacks of the above points are still being 
discussed, the focus on opening up data has recently shifted. This 
shift is an effect of two, at times overlapping, strands of research 
transforming data analysis profoundly: (1) the growing body of criti- 
cal research into the bias of datasets and (2) the development of data 
analytics through the method of ML. Both strands put new attention 
on the quality of datasets, which has not only become essential but 
also opens up room for the creation of datasets as a societal tool with 
strong political potential, which is the focus of this chapter. And while 
there is a growing body of ethics codes in different domains (Stark & 
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Hoffmann, 2019) as well as calls for “data infrastructure literacy” 
(Gray et al., 2018), computational science has far too long neglected to 
focus on questions about the creation, composition and processing of 
data. In other words, despite calls to move toward critical data stud- 
ies (Iliadis & Russo, 2016), much of our data practice, particularly 
regarding ML, has been kept invisible. Our chapter will show how 
this invisibility, which endangers the quality of data, could be chal- 
lenged if we deployed data solidarity as a principle of governance for 
the creation of datasets; a principle that could help governments and 
corporations understand datasets not just as economic opportunities 
but also as democratic resources that offer possibilities to advance the 
public good. 


On the Link between Data Quantity and Data Quality 


Ever since digital technologies have transformed data to become what 
has been called “big data” (Kitchin, 2014) —1.e., extremely large data 
sets that can be analyzed computationally to reveal patterns, trends 
and associations — new opportunities but also profound challenges re- 
garding the quality of datasets arose. Data has become a resource of 
social life leading to digital technology and sociality becoming tightly 
interwoven, at times inseparable (Marres, 2017: 7-44). With this, sub- 
stantial problems around the quality of datasets and their handling 
became apparent and have started to be discussed by a wide range of 
scholars. Contributing to critical data science, danah boyd and Kate 
Crawford (2012: 666 and 668) have, e.g., shown that bigger data is not 
automatically better data and that early “claims to objectivity and ac- 
curacy” were misleading. Ruha Benjamin (2019: 127) has pointed out 
that datasets are often “naturally occurring” within digital industries 
and are therefore taken from contexts that “reflect deeply ingrained 
cultural prejudices and structural hierarchies”. The far reach of 
those ingrained prejudices was further elaborated by Wendy Chun 
(2021: 17), who showed in her excellent study Discriminating Data that 
even when ML algorithms do not officially include race as a category, 
unbalanced datasets embed whiteness as a default. Besides racial bias, 
the digital sphere is also haunted by class (Schradie, 2011) and gen- 
der gaps, the latter exposed by Caroline Criado Perez (2019) describ- 
ing the discrimination against women through data as systemic as 
there is an invisible bias with a profound effect on women’s lives (e.g., 
there have long been life-threatening knowledge gaps within medical 
data about women’s heart attacks which manifest in slightly different 
symptoms from men’s on whom the research of this disease was long 
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focused). Catherine D’Ignazio and Lauren F. Klein (2020) have also 
made a strong case showing example after example how profoundly 
data science needs feminism. Many of the above studies are interdis- 
ciplinary, drawing on important works within Computer Science such 
as the critical study into word embedding in natural language process- 
ing (Bolukbasi et al., 2016) or into the bias of large language models 
(Bender et al., 2021). 

Being aware of such problems when gathering datasets or working 
with data is even more important in the face of ML developments ad- 
vancing the capabilities of AI, which has widened the societal reach 
of data analysis. While cookies and other digital data traces allow 
for the predictive modeling of user data, i.e., informing conclusions 
and making predictions about those users, ML goes a step further. 
It can make predictions about users from indirect information, i.e., it 
is less dependent on data directly left about and by users. This is be- 
cause of its new analytic capacity to process language, images or other 
symbols. Computational approaches to analyze these had long failed 
to succeed until ML using so-called “deep neural networks” allowed 
a breakthrough regarding the “calculation of meaning” (Bunz, 2019; 
Cantwell Smith, 2021), meaning which users accidentally leave behind 
when speaking, writing or appearing in photos or videos. Processing 
these formats and calculating meaning signified by them is a new ca- 
pacity of data analysis that substantially widens the data pool as it 
allows reaching out much wider in the analysis of user information. 
The effect of ML is therefore a profoundly deeper reach of digital tech- 
nology into the fabric of our societies, thereby affecting its social and 
political processes. 

To gain this reach, large datasets featuring our audio, video, photos 
or written texts are used to train ML systems, whereby the configura- 
tion and quality of data plays an essential role to train them correctly. 
At the same time, there has been a lack of attention regarding data- 
sets due to the fact that in computer science, their creation (such as, 
e.g., the ImageNet dataset; Deng et al., 2009) has long been valued less 
than the making of algorithms or the building of models. The reason 
for this is that gathering or acquiring a dataset is, strictly speaking, not 
a computational procedure. Many introductory books teaching ML in 
computer science assume datasets as already available (e.g., Alpaydin, 
2020: 154; Flach, 2012; Witten et al., 2011) making their creation an 
“invisible practice”. However, acquiring a dataset for training is funda- 
mental to the development of machine learning models, which is why crit- 
ical knowledge about the quality of data needs to become a standard in 
practices, from the conception to deployment of ML. While data is not 
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a computational procedure, the actual workflow when constructing a 
neural network to perform ML begins with the acquisition of a data- 
set, as Jaton showed in great detail in his ethnographic study of a com- 
puter science laboratory (2021: 54): for ML models, obtaining a dataset 
is part of “the practical processes that enable them to come into exist- 
ence” (11). In other words, datasets are essential to train ML models; 
an observation that in 2021 led Andrew Ng, Professor at Stanford Uni- 
versity Department of Computer Science and Electrical Engineering, 
to call for a more “data-centric AI” (Ng 2021). High-quality data, how- 
ever, is not sufficiently publicly available to ML developers, and this is 
often highlighted as one of the biggest issues in the field. The essential 
role described here for datasets and their quality regarding ML, and 
with that the even bigger importance datasets have come to play in the 
technical and political realities of our overdeveloped world, creates the 
need for a different approach toward data: an approach that needs to 
engage with the issues of critical data science (Iliadis & Russo, 2016) 
in face of the fact that processing data creates and deprives opportuni- 
ties. By revealing the absences, differences and disconnects within da- 
tasets, we can address some of the sociocultural problems they create. 
These issues show why a critical conceptualization that aims to make 
data more fair, transparent, available and accountable for the commu- 
nity is needed so we can think of “data as a public good”. 

The concept of “data as a public good” has been developed as a 
response to the massive deployment of data analytics by technology 
companies such as Google or Palantir. As Lane et al. (2014) point out 
in the introduction of Privacy, big data, and the public good, one of the 
first books on this topic: 


Much has been made of the many uses of (...) data for pragmatic 
purposes, including selling goods and services, winning political 
campaigns, and identifying possible terrorists. Yet big data can also 
be harnessed to serve the public good in other ways: scientists can 
use new forms of data to do research that improves the lives of hu- 
man beings; federal, state, and local governments can use data to 
improve services and reduce taxpayer costs; and public organiza- 
tions can use information to advocate for public causes, for example. 

(Lane et al., 2014: XI) 


However, in an increasingly datafied world, the systemic and struc- 
tural inequities we described earlier are intensified and exacerbated 
by narrow conceptions of how datasets are produced, reproduced, 
combined and shared. Data structures and data processes such as the 
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building of new datasets through other datasets, the combination of 
data etc. (see Roberts et al., 2021) are invisible processes that impact 
every decision that is taken based on their analysis. And these invisible 
data processes, mounded on existing systemic and structural inequi- 
ties, can have profound societal consequences. In other words, invisible 
data processes, such as non-accessible, non-structured, non-available 
or misrepresented, incomplete or biased data often impact specific 
populations and countries, and are a threat to the health and safety 
of the global public. As Roberts points out, invisibility is “a metaphor 
that figures a state of being that comes into existence when others re- 
fuse to see us, to acknowledge our existence, to accept our presence as 
making a contribution to a world of meaning” (Roberts, 1999: 121). He 
goes on to argue that invisibility is not just created systemically and 
structurally, but it is also sustained through the complicity of those 
who are invisibilized — and this is why data solidarity, as we are going 
to show, is so important. Applying this logic to the invisibilization of 
data, it becomes clear that if we act as if data processes are visible, we 
perpetuate this invisibilization and sustain the power structures that 
suppress and marginalize data and their societal impact. How can we 
balance the fears of data/public control from Big Tech with the signifi- 
cance of data for the betterment of sectors such as healthcare? A chal- 
lenge that translates into: how can we do good with better and more 
data? By now, several definitions aim to conceptualize this different 
political approach to data ranging from data justice (Dencik et al., 
2016), responsible data (van der Aalst et al., 2017) to the call for data 
trusts (Delacroix, Pineau & Montgomery, 2021). To this, we would like 
to add the concept of data solidarity and the need to overcome the in- 
visibility of data practice. In the following, we will demonstrate a need 
for this through a case study. 


Case Study: On the Role of Datasets for Machine 
Learning Research 


To understand the importance of data processes and cut through 
their invisibility, we studied the role datasets have for ML research 
in healthcare, particularly the usage of patient data to train ML sys- 
tems. Taking advantage of the abundance of ML models being trained 
and developed within healthcare, we conducted a systematic literature 
search focused on medical diagnosis on arXiv; arXiv, hosted by Cor- 
nell University, was chosen as it plays a central role for the publication 
of research by the ML community (Balki et al., 2019). Established in 
1990, the repository is generally a popular place of prepublication for 
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science, technology, engineering and mathematics (STEM) disciplines 
as it has a fast publishing turnaround getting papers out before peer 
review (Delfanti, 2016); the pace in which ML research develops cre- 
ated the need for researchers to get their findings out quickly. Our 
systematic literature research focuses on a very specific area — that of 
ML models assisting with medical diagnosis. On arXiv, 82 relevant 
studies were identified by searching “machine learning”, “medical”, 
“diagnostics”. One duplicate was removed with the use of reference 
management software. The remaining papers were included if they 
met the criteria of describing a ML experiment in a scientific paper 
that involved processing medical data entries. This led to a corpus 
of 62 papers published between 2009 and 2021 that were analyzed in 
detail regarding their usage of data. Our aim was to learn more about 
the medical datasets used when training and validating ML models, 
a process that is in parts invisibilized — while datasets are mentioned, 
their creation is often treated as negligible. The focus was therefore 
on the origin and the creation of a dataset, including the gathering 
and (in some cases for supervised learning) on the labeling of data, in- 
formation that at times is communicated in the margins (through ac- 
knowledgements, affiliations, etc.). Cleaning of existing datasets was 
not taken into account. Datasets mentioned in the papers were coded 
according to three categories: Code N for newly created datasets; code 
L for datasets that had to be labeled by medical experts to allow for 
supervised learning; code P for publicly available datasets. 

We found that over half of the experiments, 33 papers, worked 
with publicly available datasets, i.e., medical datasets that have been 
published to foster research such as the National Institutes of Health 
Chest X-Ray Dataset published by the National Library of Medi- 
cine in the US, or the Alzheimer’s Disease Neuroimaging Initiative 
(ADNI). Six further experiments used datasets of mixed status, i.e., 
some were publicly available while others were specifically created for 
the study. This procedure reflects the process of training a ML system, 
which runs through two or three interlinked phases each needing sep- 
arate datasets — the phase of training the ML model (1) and of testing 
the model (2b); some also validate the model with a step in the middle 
adjusting parameters further (2a). About a third, 21 experiments, cre- 
ated their own dataset from the ground up; all but one through a close 
collaboration with a medical institution. 

Even though the findings of this systematic review are not repre- 
sentative, they clearly show a strong tendency within ML research: 
The majority of experiments, 33 out of 62 papers, used publicly avail- 
able datasets. Adding the six experiments that made use of available 
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datasets while enriching them with newly created ones, one could 
come to the conclusion that 39 papers, i.e. 63% of the papers we re- 
viewed, worked with available datasets. Given the fact that publicly 
available datasets are rare, this clearly shows the extent to which da- 
tasets incentivize and influence the conducting of ML research — they 
are obviously needed. And this is the case for academic research as 
well as for businesses. Among our body of 62 papers were six in which 
businesses led the research or were part of it — some big ones such as 
Google Brain or Microsoft Research plus a range of less well-known, 
smaller companies. Most of them were working with available data- 
sets: among the 39 papers using publicly available datasets, five were 
conducted by businesses or in collaboration with businesses. Only one 
paper, for which academics collaborated with the British company 
Babylon Health, used a newly created dataset, most likely one Babylon 
Health held internally. 

The demand for publicly available datasets clearly shows their po- 
tential. Datasets strongly incentivize both academic and commercial 
research. Despite the talk of big data, however, they are scarce — 
platforms such as Kaggle, which allows users to find and publish data 
sets and was bought by Google in 2017, lists 50,000 datasets for more 
than one million active users. This indicates that in 2021, too many 
users conducting data analysis research worked with the same data- 
sets, which our analysis confirmed. A dataset from the Alzheimer’s 
Disease Neuroimaging Initiative (ADNI) was used five times in pa- 
pers from Russia, France, US, Pakistan and China. Overall, publicly 
available datasets such as ADNI or the chest X-ray datasets published 
by the National Institute of Health and others led to multiple papers 
using them. Papers frequently mentioned that “progress has been hin- 
dered by a sparsity of available training data, commonly attributed 
to the difficulty of publishing datasets” (McManigle et al., 2020: 1) 
or noted that “in domains where data is highly regulated and expert 
time is rare, it can be exceedingly cumbersome to obtain new expert- 
labeled data sets every time a model needs to be improved” (Cai et al., 
2019: 12). As Roberts et al. (2021) have also pointed out, the need for 
public data leads to serious issues for research. More and more da- 
tasets are “assembled from other datasets and redistributed under a 
new name”. These “Frankenstein datasets” may inadvertently include 
overlapping or identical datasets, which, in turn, lead algorithms to 
wrong diagnoses and suggestions. 

The scarcity of data and the invisible data processes that produce 
datasets lead to working with unbalanced datasets — an issue that im- 
pacts not just the medical but all sectors, and with it, society. While 


From Big to Democratic Data 9 


data is abundant, the majority of datasets are proprietary and built for 
commercial reasons with no oversight. At the same time, as demon- 
strated by the high number of Kaggle users compared to the low num- 
ber of public datasets, publicly available datasets are generally scarce. 
While the issue is known, the low regard for the creation of datasets, 
which, as we have shown, is often not seen as an act of computer engi- 
neering and not taught in introductory ML books, makes the need for 
public datasets pertinent. 

This is where an approach foregrounding the democratic value of 
data and an initiative to create datasets making them publicly availa- 
ble out of a gesture of solidarity could help. This is even more the case, 
as in current debates, the focus on the collective value datasets have 
for society is often missing (Delacroix & Montgomery, 2020; Viljoen 
2021). This is worrying as data analysis, driven further by ML, has 
become a process people experience directly or indirectly everyday: 
when shopping on the internet, when using government services or 
when applying for a loan or an insurance. As long as these data anal- 
ysis decisions are based on commercial datasets without checks and 
balances and to which there is no alternative, there will be issues of 
bias and fairness leading to a lack of trust. This importance of taking 
the collective value of data into perspective has been demonstrated by 
Salome Viljoen (2021). In her in-depth report on the issue titled “Dem- 
ocratic Data”, she correctly reminds readers: 


The data economy has resulted in massive collection of informa- 
tion regarding consumer purchasing preferences and social net- 
works, for instance, but has contributed comparatively little to 
ongoing discussions concerning waste production, water usage, 
or how wealth from financial instruments flows globally. 

(649) 


With the understanding of big data as something mainly useful for 
business, data to support our democratic public infrastructures 
needs further strengthening. Admittedly, the change needed here is 
not just infrastructural, it is also political. A democratic use of data 
could tackle bias in datasets and handle it more transparently; it could 
turn toward opportunities such as programmes to create datasets in 
under-researched areas that are socially relevant or help us understand 
niche issues that have been consistently ignored due to lack of corpo- 
rate or government interest. As Viljoen points out: “Datafication is 
not only unjust because data extraction or resulting datafied govern- 
mentality may violate individual autonomy; datafication may also be 
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unjust because it violates ideals of social equality” (58). Viljoen calls 
for a shift in the understanding of data “from an individual medium 
expressing individual interests, to a democratic medium that materi- 
alizes population-level, social interests” (54). This would also mean 
the following: 


- data does not only need to be gathered where it naturally occurs, 
instead governments need to start initiating the collection of da- 
tasets to ensure democratic values; 

- datasets could be used to allow citizens a better representation in 
the conditions and purposes of data production; 

- issues with bias in datasets can be targeted or made transparent; 

- datasets could be used to incentivize ML research in particular 
areas attractive from a societal and not commercial perspective. 


These points, however, depend on the availability of data and the 
willingness of citizens to embrace data sharing for the public good. 
Naturally, the gaining of data for the public good operates differently 
from the commercial top-down approach leading to data extraction. 
Instead, it must revolve from a participatory understanding of data 
sharing and a belief in “data commons”. This needs communicative 
work. As Dulong de Rosnay and Stalder have (2020: 16) pointed out: 


The constitution of data commons (...) needs to overcome the 
apparent contradiction between personal data and property, and 
between privacy and open access, as a personal data commons 
would not lead to sharing personal information, but to govern 
their reuse according to values of the digital commons. 


This brings the importance of solidarity, exercised by giving data to 
support the community, to the fore. 


Toward Data Solidarity 


In order to develop not just fair and transparent but also democratic 
and visible data processes, we propose that we need to cultivate and 
sustain a culture of solidarity in data sharing processes. Solidarity has 
functioned asa key principle in democratic struggles of the past, such as 
the labor union Polish Solidarnosé of the 1980s, the mid-19th-century 
French workers’ fight against oppression (Wilde, 2013) and in the most 
recent past, in social movements such as Occupy (Vrikki, 2018). In so- 
cial movements, solidarity visibilizes and materializes values such as 
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trust, openness and common principles (Pavan & della Porta, 2020). 
In the data era we currently experience, living with others and the 
social construction of our societies have given solidarity a wider role 
that does not just hold political importance, it can also be perceived 
as a form of caring and protecting others (Chatzidakis et al., 2020). At 
the same time, this can build on interpretations of solidarity in social 
theory where one finds, on the one hand, interpretations that perceive 
solidarity as the sum of norms contributing to social cohesion, e.g., 
in the works of Emilé Durkheim (1984, 2001), and on the other hand, 
one finds interpretations that deduce solidarity as a relationship be- 
tween members of a group with common interests, referring to the 
works of Marx (1906) and Weber (1978). Beyond social theory, polit- 
ical philosopher Scholz (2008) has identified three kinds of solidar- 
ity: social solidarity (describing the relationship between the group), 
civic solidarity (referring to the relationship between citizens and the 
state) and political solidarity (expressing the commitment and morals 
of the individual), which divide solidarity based on the relationships 
onto which it depends on. The variety of approaches within social 
and political theory shows how ingrained solidarity is in our social, 
political and cultural lives that in everyday life gets often translated 
as the process of supporting the vulnerable, as acts of public caring 
such as education, welfare and healthcare and as the primary care 
relations we build and sustain through friendships, households and 
families (Lynch, 2007). 

Building on these interpretations and approaches, we identify data 
solidarity as an articulation of visibilizing data processes for the benefit 
of public good. The proposition here is to perceive data solidarity in a 
productive opposition to current hierarchical data structures as well 
as to the latent processes of the neoliberal market, personal respon- 
sibility and individual agency (Cohen, 2010). This is pertinent to the 
conceptualization of data processes as a set of democratic norms that 
together reinforce the capacity of communities to produce collective 
goods for the public benefit (Laitinen & Pessi, 2014). Recent critical 
studies into the democratisation of AI, for example by Himmelreich 
(2021), have stressed that the matter is complicated and that there is no 
simple administrative panacea to the injustices that are perpetuated 
by AI. Attention towards ways, in which democratic governance of 
Alcan be initiated and structured, are still underdeveloped. Informed 
by these reasons, we propose ‘data solidarity’ as a value supporting a 
process to enhance our AI futures in the same way solidarity between 
working class and farmers resulted in the establishment of a univer- 
sal pension system (Baldwin, 1990). Data solidarity can advance the 
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inclination of corporate and public data stakeholders to share both 
the risks and the benefits of data access, production and sharing. The 
term solidarity is “sometimes used as a nebulous concept” (Stjernø, 
2009: 2), but data solidarity can most conducively be defined as the 
willingness to share datasets and resources with others while acknowl- 
edging the invisible processes that take place during the creation, pro- 
duction and sharing of datasets. Visibilizing those processes and their 
flaws that may result in marginalizations such as racism, sexism and 
classism accentuate the need for a collective action that will be based 
on the values or solidarity. 


Conclusion: Moving from Big to Democratic Data 


In the same ways in which our political and financial systems have 
determined so much of our behaviors and societies, data analytics are 
and will keep stretching our cultures and democracy. In this chapter, 
we aim to answer this challenge by making the political force of data 
practices visible. Our argument positions itself as an addition to the 
ongoing debate about critical data practice, which aims “to account 
for, inventively respond to and intervene around the socio-technical 
infrastructures involved in the creation, extraction and analysis of 
data” (Gray et al., 2018: 8). Our research also builds on recent in- 
sights into collective aspects regarding datasets (Delacroix, Pineau 
& Montgomery, 2021; Viljoen, 2021), insights that (a) are gained from 
the collective, i.e., from relations between data entries, and could (b) 
be processed for the collective advancing the public good. To advance 
this, the tendency to shroud data practice in invisibility needs to end. 
To move from big to democratic data, we need to understand datasets 
and data infrastructure as democratic tools which can advance soci- 
etal interests and assist with bringing forth elements of public good 
for populations. How influential publicly available datasets are, could 
be seen in our case study of medical diagnosis through ML systems 
trained on medical data. To encourage the building of such publicly 
available datasets, we need a new notion of data: next to the under- 
standable fear about surveillance through the extraction of data, we 
need to stress the potential that data sharing has in public hands and 
move toward data solidarity. While there is no simple answer to the 
question “how can we do good with better and more data?”, we know 
that ultimately it boils down to collective action. By deploying so/- 
idarity as a principle of data governance for the creation of publicly 
held datasets, we can start building trust and accountability. Digi- 
tal technologies, AI systems such as ML and other advanced data 
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analytics can help us better our societies if we deploy principles of 
critical data practice that visibilize data processes and apply a critical 
approach to datasets aiming for the inclusion of different kinds of 
data. As we stand at the precipice of datafied democracy, now is an 
opportunity for a steady refocus on how data and data infrastructure 
can support inclusion. The data infrastructures we shape, shape us 
in return. The rise of AI has made these infrastructures even more 
important. To shape these infrastructures according to democratic 
values, the principle of data solidarity is essential. 

We would like to express our thanks to Shuprima Guha, Jonathan 
Gray and Adam Bull for their useful comments and corrections. 
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